What is this?

This is a notebook that will show how to do some very basic frequentist inferential statistics in R - a binomial test, a Chi Square test, and an independent two group t-test. This is by no means an exhaustive exercise in inferential statistics, and the data we’re using was collected to illustrate the use of the tests rather than reveal some heretofore undiscovered truth about humanity.

We did a survey of three very simple questions:

  • Do you pronounce the g in gif as in giant or gig -> Categorical
  • Do you prefer Facebook or Twitter -> Categorical
  • On a scale of 0-100, how sick are you of staying home? -> Continuous/Ratio

You can find the survey here. I’ve declared this endeavor Prawn Gif because it’s mainly about the pronunciation of the word gif (abbreviated as PronGif). You can find the notebook, data, etc here on RStudio Cloud.

First, we’ll start by reading in the data, which is in tab-separated format. (note that if you’ve just done the survey yourself, this data won’t include your responses, as it was pulled on March 25, 2020)

dem<-read.csv("DataDem.tsv",header=T, sep="\t")
head(dem)
##            Timestamp PronGif PrefSocial SickHome
## 1 3/23/2020 19:04:13     gig    Twitter       14
## 2 3/23/2020 19:05:17   giant    Twitter        0
## 3 3/23/2020 19:05:36     gig    Twitter       40
## 4 3/23/2020 19:05:47     gig    Twitter       50
## 5 3/23/2020 19:06:33     gig    Twitter       20
## 6 3/23/2020 19:06:42     gig    Twitter        0

Describing the data

What does the data look like? We know from the original summary within the survey that most people prefer a voiced velar stop (as in gig, ~75%) over an affricate (as in giant, ~25%) when pronouncing gif, and they prefer Twitter (~80%) over Facebook (~20%). This is probably a sampling issue due to my personal social network sizes on the two platforms (most of the participants came from me just asking people who followed me to fill out the form).

What we don’t know is how these intersect - in other words, how users of Facebook pronounce gif relative to users of Twitter, so let’s have a look at this.

p<-ggplot(data=dem, aes(x=PronGif,fill=PrefSocial))+
  geom_bar(position="fill")+
  ylab("Proportion")+
  xlab("Pronunciation Preference")+
  theme_bw()

p

We also haven’t looked at whether people’s fatigue with being home varies meaningfully between pronunciation styles or social media network preferences, so let’s have a look at that too:

p2<-ggplot(data=dem,aes(x=PronGif,y=SickHome))+
  geom_violin(aes(fill=PronGif))+
  xlab("Pronunciation Preference")+
  ylab("How sick are you of being at home?")+
  theme_bw()

p2

p3<-ggplot(data=dem,aes(x=PrefSocial,y=SickHome))+
  geom_violin(aes(fill=PrefSocial))+
  xlab("Preferred Social Media Network")+
  ylab("How sick are you of being at home?")+
  theme_bw()

p3

Binomial Test

The binomial test is one of the simplest tests out there. It takes two categories and determines if the counts in those categories differ from what we would consider to be the null hypothesis. In this case, we’ll use 50/50. In other words, our null hypothesis is that people have no particular preference between the velar stop (gig) and affricate (giant) pronunciations of gif.

We use the binom.test() function.

#summary(dem$PronGif)

#giant   gig 
#   53   160 

binom.test(160,213,0.5)
## 
##  Exact binomial test
## 
## data:  160 and 213
## number of successes = 160, number of trials = 213, p-value = 1.127e-13
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.6875007 0.8077126
## sample estimates:
## probability of success 
##              0.7511737

This shows (as we could have guessed from a look at the proportions), that this data is very unlikely to come from a population that has no particular preference in pronuciation - in fact, there is only a .0000000000127% chance that we would get 160 velar stops and only 53 affricates by sampling a population that had no particular preference in how they pronounced gif.

Chi Square test

A chi square test will tell us whether two groups differ from each other in terms of their response or membership for some other categorical variable. For our data, we can look at whether people who prefer Facebook differ in terms of how they prefer to pronounce gif from people who prefer Twitter. The null hypothesis would be that we won’t find any difference in pronunciation between the two groups.

First, we need to convert this to a contingency table that gives us the number of Facebook users that prefer giant and gig, and likewise for Twitter:

demTab<-table(dem$PronGif,dem$PrefSocial)
demTab
##        
##         Facebook Twitter
##   giant       12      41
##   gig         31     129

Our null hypothesis here would be that there is no difference in pronunciation of the word gif across different social media networks. The chi square test is going to tell us if the proportion of giant pronouncers is more or less the same across Facebook and Twitter (i.e., comes from a null hypothesis world where preferred social media network has nothing to do with gif pronunciation), or if it is different across Facebook and Twitter (comes from a world that would support our hypothesis that there is some systematic difference here).

chisq.test(demTab)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  demTab
## X-squared = 0.099888, df = 1, p-value = 0.752

These results, p=0.752, mean that there is a 75.2% chance that these values come from a population where there is no systematic difference in pronunciation across groups. This is a pretty high chance, we can’t reject the null hypothesis. In other words, which social media network you prefer probably has nothing to do with how you pronounce gif.

t-test

Here we’ll perform an independent two group t-test. The null hypothesis here would be that the mean of a continuous variable (in this case SickHome) does not differ between two groups (in this case, either Facebook and Twitter users or gif/jif pronouncers). There are other kinds of t-tests we won’t cover here, but note that they always deal with differences in means, and therefore require a continuous variable.

# independent 2-group t-test
x<-t.test(dem$SickHome~dem$PronGif) # where y is numeric and x is a binary categorical 
y<-t.test(dem$SickHome~dem$PrefSocial)
x
## 
##  Welch Two Sample t-test
## 
## data:  dem$SickHome by dem$PronGif
## t = -0.094184, df = 90.717, p-value = 0.9252
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.540219   9.585973
## sample estimates:
## mean in group giant   mean in group gig 
##            43.91038            44.38750
y
## 
##  Welch Two Sample t-test
## 
## data:  dem$SickHome by dem$PrefSocial
## t = 1.6789, df = 64.679, p-value = 0.098
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.750482 20.209073
## sample estimates:
## mean in group Facebook  mean in group Twitter 
##               51.63488               42.40559

The test for SickHome~PronGif has a very high p-value (p > 0.9), meaning there is over a 90% chance that our results come from a population where velar stop gif pronouncers are no more or less sick of being at home than affricate gif pronouncers.

The second test, for SickHome~PrefSocial, has a much lower p-value of 0.098, but this is still higher than the usual critical \(\alpha\) = 0.05. Looking at the mean in-group estimates for Facebook and Twitter at the bottom of the test output, we can see that Facebook users are slightly more sick (51.6) of being at home than Twitter users (42.4). The p-value means that there is about a 10% chance we could have gotten this slight difference from a population where there is actually no difference at all, and usually researchers consider this chance to be too high, and so can’t reject the null hypothesis. However, some fields use an \(\alpha\) = 0.1, and others who did use \(\alpha\) = 0.05 might refer values of \(\alpha\) < 0.1 as “marginally significant”, meaning they are suggestive of support for the hypothesis, but inconclusive (further work is needed).

Something that requires caution is that we perhaps did a bit of digging here: if we test every categorical variable we can think of against SickHome (of course, we only had two) we’ll eventually find one that looks like it’s systematically different across groups. Remember that the p-value represents the probability that you would find the result you did given the null hypothesis is true, and the more you query the same data, the more likely you become to accidentally find support for your hypothesis even though the null hypothesis shouldn’t be rejected (i.e., you become more prone to Type I errors).